- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
More decimal 32/64 support - type coercsion and misc gaps #17808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AdamGS
| DataType::Decimal32(..) => compare_value!(Decimal32Array), | ||
| DataType::Decimal64(..) => compare_value!(Decimal64Array), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am curious why 256 is omitted? Perhaps we can add it in if there's no compiler error doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add it. Through this work I've noticed a few places where it was omitted, but I've yet to find a reason or even a failing test that would explain it, usually its just a matter of adding a branch to a match statement like here.
| Int8 | Int16 | Int32 | Int64 | Float32 | Float64 | Decimal128(_, _) | ||
| Int8 | Int16 | ||
| | Int32 | ||
| | Int64 | ||
| | Float32 | ||
| | Float64 | ||
| | Decimal32(_, _) | ||
| | Decimal64(_, _) | ||
| | Decimal128(_, _) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use is_signed_numeric() here potentially (though that brings in Float16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like its probably fine, also talked to my local Spark expert and seems like it should be fine (+ I assume there are other places where we deal with the conversion to spark's type system)
| let s = s1.max(s2); | ||
| let range = (p1 as i8 - s1).max(p2 as i8 - s2); | ||
| let required_precision = (range + s) as u8; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we have:
Decimal256 with precision 76 (max) and scale 0, and Decimal128 with precision 38 (max) with scale 1;
So s = 1, range = 76, required_precision = 76 + 1 -> overflow?
Is this a valid case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think an overflow is valid, I'll have to think about it and maybe look into solutions in other systems.
We can also just return None in that case, which should force the user to add an explicit cast to one side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked around a bit, and what I could find is:
- DataFusion already has multiple issues regarding cast overflow/precision loss (decimal calculate overflow but not throw error #16406, Datafusion downcasts decimal loosing precision #13492), which I'm happy to take on but are unrelated here.
- Spark (which seems to be the main inspiration for this code) has a configuration to control how it handles these cases (here and here).
I'm not sure what's the desired behavior regarding precision loss (should it be configurable? Is there currently an accepted desired behavior?), I think for this PR it should be fine to just return None if the precision overflows, and take the bigger conversation into an issue where people can weigh in, and I'll be glad to take that forward. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning None in cases like this for this PR is fine 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as part of fd1f043
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers; left another minor comment related to the check below. Also would be nice if we had a test for this edge case.
62eb314    to
    b4c7400      
    Compare
  
    Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
b4c7400    to
    fd1f043      
    Compare
  
    | // We currently don't handle cases where the required precision overflows | ||
| if required_precision > DECIMAL256_MAX_PRECISION { | ||
| return None; | ||
| } | ||
|  | ||
| // Choose the larger variant between the two input types | ||
| match (lhs_type, rhs_type) { | ||
| (Decimal32(_, _), Decimal64(_, _)) | (Decimal64(_, _), Decimal32(_, _)) => { | ||
| Some(Decimal64(required_precision, s)) | ||
| } | ||
| (Decimal32(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal32(_, _)) => { | ||
| Some(Decimal128(required_precision, s)) | ||
| } | ||
| (Decimal32(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal32(_, _)) => { | ||
| Some(Decimal256(required_precision, s)) | ||
| } | ||
| (Decimal64(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal64(_, _)) => { | ||
| Some(Decimal128(required_precision, s)) | ||
| } | ||
| (Decimal64(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal64(_, _)) => { | ||
| Some(Decimal256(required_precision, s)) | ||
| } | ||
| (Decimal128(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal128(_, _)) => { | ||
| Some(Decimal256(required_precision, s)) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // We currently don't handle cases where the required precision overflows | |
| if required_precision > DECIMAL256_MAX_PRECISION { | |
| return None; | |
| } | |
| // Choose the larger variant between the two input types | |
| match (lhs_type, rhs_type) { | |
| (Decimal32(_, _), Decimal64(_, _)) | (Decimal64(_, _), Decimal32(_, _)) => { | |
| Some(Decimal64(required_precision, s)) | |
| } | |
| (Decimal32(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal32(_, _)) => { | |
| Some(Decimal128(required_precision, s)) | |
| } | |
| (Decimal32(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal32(_, _)) => { | |
| Some(Decimal256(required_precision, s)) | |
| } | |
| (Decimal64(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal64(_, _)) => { | |
| Some(Decimal128(required_precision, s)) | |
| } | |
| (Decimal64(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal64(_, _)) => { | |
| Some(Decimal256(required_precision, s)) | |
| } | |
| (Decimal128(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal128(_, _)) => { | |
| Some(Decimal256(required_precision, s)) | |
| } | |
| // Choose the larger variant between the two input types | |
| match (lhs_type, rhs_type) { | |
| (Decimal32(_, _), Decimal64(_, _)) | (Decimal64(_, _), Decimal32(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => { | |
| Some(Decimal64(required_precision, s)) | |
| } | |
| (Decimal32(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal32(_, _)) if required_precision <= DECIMAL128_MAX_PRECISION => { | |
| Some(Decimal128(required_precision, s)) | |
| } | |
| (Decimal32(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal32(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => { | |
| Some(Decimal256(required_precision, s)) | |
| } | |
| (Decimal64(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal64(_, _)) if required_precision <= DECIMAL128_MAX_PRECISION => { | |
| Some(Decimal128(required_precision, s)) | |
| } | |
| (Decimal64(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal64(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => { | |
| Some(Decimal256(required_precision, s)) | |
| } | |
| (Decimal128(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal128(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => { | |
| Some(Decimal256(required_precision, s)) | |
| } | 
Do we need to account for the other decimal types like so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some tests to make sure what's the correct thing here is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done as part of 4145a04
| let s = s1.max(s2); | ||
| let range = (p1 as i8 - s1).max(p2 as i8 - s2); | ||
| let required_precision = (range + s) as u8; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers; left another minor comment related to the check below. Also would be nice if we had a test for this edge case.
| pub fn decimal_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
| use arrow::datatypes::DataType::*; | ||
|  | ||
| match (lhs_type, rhs_type) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be cleaner like so:
/// Decimal coercion rules.
pub fn decimal_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
    use arrow::datatypes::DataType::*;
    // Prefer decimal data type over floating point for comparison operation
    match (lhs_type, rhs_type) {
        // Same decimal types
        (lhs_type, rhs_type)
            if std::mem::discriminant(lhs_type) == std::mem::discriminant(rhs_type) =>
        {
            get_wider_decimal_type(lhs_type, rhs_type)
        }
        // Mismatched decimal types
        (lhs_type, rhs_type)
            if is_decimal(lhs_type)
                && is_decimal(rhs_type)
                && std::mem::discriminant(lhs_type)
                    != std::mem::discriminant(rhs_type) =>
        {
            get_wider_decimal_type_cross_variant(lhs_type, rhs_type)
        }
        // Decimal + non-decimal types
        (Decimal32(_, _) | Decimal64(_, _) | Decimal128(_, _) | Decimal256(_, _), _)
        | (_, Decimal32(_, _) | Decimal64(_, _) | Decimal128(_, _) | Decimal256(_, _)) => {
            get_common_decimal_type(lhs_type, rhs_type)
        }
        (_, _) => None,
    }
}Following what was done above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot the is_decimal() checks for the first branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I've added them locally :) should have something soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done as part of 4145a04
| Added tests to check for precision overflow. This really brings things close to making  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
* More small decimal support * CR comments Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com> * Add tests and cleanup some code --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
* More small decimal support * CR comments Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com> * Add tests and cleanup some code --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
Which issue does this PR close?
This is a followup of #17501 and #17489 (Should I open a dedicated issue?).
The changes are taken out of #17759 as the SQL-facing changes ended up making more of an impact than I expected, and should probably be discussed in #17747.
Rationale for this change
Improve support for working with decimal columns of different width
What changes are included in this PR?
The biggest change in this PR is that type coercion for binary expression now not only supports the new decimal types, but it will try and extend one side into the other, which would just fail right now.
Minor changes
abs(deicmal32/64)Are these changes tested?
Additional coercion tests
Are there any user-facing changes?
No public API changes, but more combination of operations should work.